interactive application
SpeakStream: Streaming Text-to-Speech with Interleaved Data
Bai, Richard He, Gu, Zijin, Likhomanenko, Tatiana, Jaitly, Navdeep
--The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays - even with optimized inference speeds - when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems. Our demo website is available at https://apple.github.io/speakstream-demo. Index T erms --text-to-speech, speech synthesis, streaming Recent years have witnessed a surge of interest in speech interfaces for large language models (LLMs).
DipMe: Haptic Recognition of Granular Media for Tangible Interactive Applications
Wang, Xinkai, Zhang, Shuo, Zhao, Ziyi, Zhu, Lifeng, Song, Aiguo
While tangible user interface has shown its power in naturally interacting with rigid or soft objects, users cannot conveniently use different types of granular materials as the interaction media. We introduce DipMe as a smart device to recognize the types of granular media in real time, which can be used to connect the granular materials in the physical world with various virtual content. Other than vision-based solutions, we propose a dip operation of our device and exploit the haptic signals to recognize different types of granular materials. With modern machine learning tools, we find the haptic signals from different granular media are distinguishable by DipMe. With the online granular object recognition, we build several tangible interactive applications, demonstrating the effects of DipMe in perceiving granular materials and its potential in developing a tangible user interface with granular objects as the new media.
Build a Named Entity Recognition App with Streamlit
In my previous article, we fine-tuned a Named Entity Recognition (NER) model, trained on the wnut_17[1] dataset. In this article, we show step-by-step how to integrate this model with Streamlit and deploy it using HugginFace Spaces. The goal of this app is to tag input sentences per user request in real time. Also, keep in mind, that contrary to trivial ML models, deploying a large language model on Streamlit is tricky. We also address those challenges.
Beating Common Sense into Interactive Applications
Lieberman, Henry, Liu, Hugo, Singh, Push, Barry, Barbara
A long-standing dream of artificial intelligence has been to put commonsense knowledge into computers -- enabling machines to reason about everyday life. However, it is widely assumed that the use of common sense in interactive applications will remain impractical for years, until these collections can be considered sufficiently complete and commonsense reasoning sufficiently robust. Recently, at the Massachusetts Institute of Technology's Media Laboratory, we have had some success in applying commonsense knowledge in a number of intelligent interface agents, despite the admittedly spotty coverage and unreliable inference of today's commonsense knowledge systems.